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ABSTRACT 
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general, and the Rasch model, in particular, is discussed. The 
utility of the standardization approach for assessing differential 
distractor functioning is described. Several issues in applied DIF 
analyses are discussed, including inclusion of the studied item in 
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Abstract 



At the Educational Testing Service, the Mantel-Haenszel procedure is 
used for differential item functioning (DIP) detection and the standardization 
procedure is used to describe DIP. This report describes these procedures. 
First, an important distinction is made between DEF and Impact, pointing the 
need to compare the comparable. Then, these two contingency table DIP 
procedures are described in some detail, first in terms of their own origins as 
DIP procedures, and then from a common framework that points out 
similarities and differences. The relationship between the Mantel-Haenszel 
procedure and IRT models in general and the Rasch model, in particular, is 
discussed. The utility of the standardization approach for assessing 
differential distractor functioning is described. Several issues in applied DIP 
analyses are discussed including inclusion of the studied item in the 
matching variable, and refinement of the matching variable. Puture research 
topics dealing with the matching variable, the studied variable and the group 
variable are also discussed. 



DIF Detection and Description: Mantel-Haenszel and Standaidization 
Neil J. Dorans and Paul W. Holland 



Differential item functioning (DIF) refers to a psychometric difference 
in how an item functions for two groups. DIF refers to a difference in item 
performance between two comparable groups of examinees, that is, groups 
that are matched with respect to the construct being measured by the test. The 
comparison of matched or comparable groups is critical because it is 
important to distinguish between differences in item functioning from 
difference between groups. 

In the first chapter of the book. Handbook of Methods for Detecting 
Test Bias, Shepard (1982) defines what was then called item bias and is now 
referred to as DIF as psychometric features of the item that can misrepresent 
the competence of one group. She provides an understanding of the 
meaning of DIF by presenting some conceptual definitions of the term, 
including: 

An item is unbiased if, for all individuals having the same 
score on a homogeneous subtest containing the item, the 
proportion of individuals getting the item correct is the same 
for each population group being considered, (Scheuneman, 
1975, p. 2) 

This definition by Scheuneman may be the earliest contingency table 
definition of DIF. It is the definition underlying the observed score DIF 
approaches described in this report. 

Lord (1980) provides the item response theory definition of DIF: 

// each test item in a test had exactly the same item 
response function in every group, then people of the 
same ability or skill would have exactly the same chance 
of getting the item right, regardless of their group 
membership. Such a test would be completely unbiased. 
If on the other hand, an item has a different item 
response function for one group than for another, it is 
clear that the item is biased, (p. 212) 

This item response theory definition underlies the DIF procedures described 
in Thissen, Steinberg and Wainer (in press). 
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Thissen (1987), in his discussion of a series of DDF papers dealing with 
DIF on the Scholastic Aptitude Test (SAT) that are contained in Schmitt and 
Dorans (1987), adds to these definitions by referring to DIF as: 

...an expression which describes a serious threat to the validity 
of tests used to measure the aptitude of members of different 
populations or groups. Some test items may simply perform 
differently for examinees drawn from one group or another or 
they may measure "different things" for members of one group 
as opposed to members of another. Tests comparing such 
itew.s may have reduced validity for between-group 
comparison, because their scores may be indicative of a variety 
of attributes other than those the test is intended to measure. 
(p. 1) 

Statistical methods used to identify DIF are defined by Shepard (1982) 
as: " internal methods designed to ensure that the meaning, which 
individual items attribute to the total test, is the same for all subgroups. ( p. 
23). A variety of methods have been used since the 1950s.. Two methods 
presently employed at the Educational Testing Service for DIF assessment are 
the standardization approach (Dorans & Kulick, 1986) and the Mantel- 
Haenszel approach (Holland & Thayer, 1988). Both procedures compare 
matched or comparable groups. This report describes these two procedures in 
some detail. 

The structure of the report is as follows. DEF is contrasted with impact 
via Simpson's paradox, which demonstrates the importance of matching in 
DIF studies. Then a definition of DEF is offered. The Mantel-Haenszel(MH) 
procedure is described as a statistically powerful method for detecting DEF, and 
the standardization approach is described as a flexible procedure for describing 
DIF. A common framework from which to view these two related 
procedures is then presented. Then, the relationship between the MH 
procedure and the Rasch model under the condition that the Rasch model is 
appropriate for the data is discussed. Next, the utility of the standardization 
approach for assessing differential distractor functioning is described. Some 
issues in applied DIF analyses are discussed. Finally, future directions in DIF 
analyses are considered. 



1. DIF Not Impact 

It is important to make a distinction between DIF and impact. Impact 
refers to a difference in performance between two intact groups. Impact is 
everywhere in test and item data because individuals differ with respect to 
the developed abilities measured by items and tests, and intact groups, such as 
those defined by ethnicity and gender differ with respect to the distributions 
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of developed ability among their members. For example, on a typical SAT- 
Mathematics item it is usually the case that Asian-Americans, as a group, 
score higher than Whites, males score higher than females, and juniors and 
seniors score higher than junior high school students. This difference in 
performance is called impact. Frequently, impact on any given item is 
consistent with impact on other items of the same type. In fact, impact at the 
item level is frequently explained by impact across all items of similar type or 
impact at the total score level. 

In contrast to impact, which can often be explained by stable consistent 
differences in examinee ability distributions across groups, DIF refers to 
differences in item functioning after groups have been matched with respect 
to the ability or attribute that the item purportedly measures. Unlike impact, 
where differences in item performance reflect differences in overall ability 
distributions, DIF is an unexpected difference among groups of examinees 
who are supposed to be comparable with respect to the attribute measured by 
the item and test on which it appears. 

1.1 Simpson's Paradox 

Simpson^s paradox (Simpson, 1951) illustrates why we should compare 
the comparable, as is done in DIF analyses. The following table summarizes 
the performance of two hypothetical groups, A and B, on an imaginary item. 

Group A Group g 





Ncm 


Ncm/Njn 


Nm 


Ncm 


Ncm/Njn 


400 


40 


.10 


1000 


200 


.20 


1000 


500 


.50 


1000 


600 


.60 


1000 


900 


30 


400 


400 


LQ 


2400 


1440 


.60 


2400 


1200 


.50 



This table contains four rows and six columns of numbers The first three 
columns pertain to group A, while the last three pertain to group B. The first 
three rows pertain to three different ability levels ranging from the lowest to 
the highest, while the fourth row sums across ability levels. (In the case of 
the the third and sixth columns, the sum in the fourth row is a weighted 
sum.) The symbols Nm, Ncm, and Ncm/Nm refer to the number of people 
at the ability level m, the number of people at ability level m who answered 
the item correctly, and the proportion at ability level m who answered the 
item correctly, respectively. 

Of the 2,400 examinees in group A, 1,440 or 60% answered the item 
correctly. In contrast, only 50%, 1,200 of 2,400, of group B answered the item 
correctly. The impact on this item is .6 - .5 = .1 in favor of group A. 
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Upon closer examination, however, the ratio Ncm/Nm at each of the 
three ability levels for group A is actually .1 lower than the corresponding .. 
ratio for group B. These conditional proportions are .1, .5, and .9 for group A, 
and ,2, .6. and 1.0 for group B. Hence, when we compare the comparable at 
each ability level m, we find that this item actually favors group B over group 
A, not vice versa as suggested by impact. This contradiction between impact 
and DD? is due to unequal distributions of ability in groups A and B, as seen in 
the Nm columns. This imaginary item actually disadvantages group A, but 
since group A is more able than group B, the overall impact suggests that the 
item favors group A. 

Simpson's paradox has a rich history in the statistical literature (e.g., 
Blyth, 1972; Wagner, 1982; Yule, 1903). Recently, Wainer (1986) illustrated 
how this paradox affects the interpretation of changes in SAT mean scores 
over time. Simpson's paradox illustrates the importance of comparing the 
comparable. Both the standardization approach (Dorans & Kulick, 1983, 1986), 
which has been used on the SAT since 1982, and the Mantel-Haenszel 
method (Holland & Thayer, 1988), which has been used with most ETS testing 
programs since 1987, emphasize the importance of comparing the 
comparable. In practice, both approaches use equal ability as measured by 
total test score as a measure of comparability. They share a common 
definition of Null DIP, namely that there is no differential item functioning 
between groups after they have been matched on total score. In theory, both 
procedures are flexible enough to match on more than total score ( see last 
portion of this report for a discussion of this issue). In practice, matching is 
typically based on a single total score. 

These two DIP assessment procedures are highly related and 
complement each other well. The Mantel-Haenszel is a statistically powerful 
technique for detecting DIP. Standardization is a very flexible, easily 
understood descriptive procedure that is particularly suited for assessing 
plausible and implausible explanations of DIP. 

2. Mantel-Haenszel: Testing the Constant Odds Ratio Hypothesis Version of 

DIP 

In their seminal paper. Mantel and Haenszel (1959) introduced a new 
procedure for the study of matched groups. Holland (1985) and later Holland 
and Thayer (1988) adapted the procedure for use in assessing differential item 
functioning. This adaptation is used at the Educational Testing Service as the 
primary DIP detection device. The basic data used by the MH method are in 
the form of M 2 X 2 contingency tables or one large thiree dimensional 2-by-2- 
M table. 



2.1 The 2-by-2-by-M Contingency Table 



Under rights scoring for the items in which responses are coded as 
either correct or incorrect (including omissions), counts of rights and wrongs 
on each item can be arranged into a 2-by-2-by-M contingency table for each 
item being studied. There are two levels for group; the focal group that is the 
focus of analysis and the reference group that serves as a basis for comparison 
for the focal group. At ETS, the current practice is to do analyses in which 
Whites are the reference group, and Blacks, Hispanics, Asian-Americans, and 
Native Americans, serve as the focal groups, and analyses in which females 
are the focal group and males are the reference group. There are two levels 
for item response; right or wrong, and there are M score levels on the 
matching variable, e.g. total score. Finally, the item being analyzed is referred 
to as the studied item. The 2(groups)-by-2(item scores)-by-M (score levels) 
contingency table for each item can be viewed in l-by-Z slices (there are M 
slices per item) as shovm below: 



Item Score 

Right Wrong Total 

Group 

Focal Group (f) Rfm Wfm Nfm 

Reference Group (r) ^xm "^xm Nrm 



Total Group (t) Rtm Wtm Nt 



m 



The null DIF hypothesis ^ for the Mantel-Haenszel method can be 
expressed as 

Ho: [Rrm/Wrml / [Rfm/Wfml = 1 m = 1,..., M , 

or alternatively. 

Ho: [Rrm/Wrml = [Rfm/Wfml m = l,.-vM • 

In other words, the odds of getting the item correct at a given level of the 
matching variable is the same in both the focal group and the reference 
group, across all M levels of the matching variable. 



^Note that in stating hypotheses we have not distinguished between population and sample 
quantities. All of our hypotheses should read as relations among the expectations of the 
indicated statistics. 
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22 The Constant Odds Ratio Hypothesis 

In their original work, Mantel and Haenszel (1959) developed a chi- 
square test of the null DIF hypothesis against a particular alternative 
hypothesis known as the constant odds ratio hypothesis. 

Ha: IRrm/Wrml = a [Rfm/Wfml m = l,..vM and a 1. 

Note that when a = 1, the alternative hypothesis reduces to the null DIF 
hypothesis. The parameter a is called the common odds ratio in the M 2-by- 
2 tables because under Ha, the value of a is the odds ratio that is the same for 
all m, 

cxm = IRrm/Wrml/IRfm/Wfml = IRrmWfml/IRfmWrml- 

23 Chi Square Test Statistic 

There is a chi-square test associated with the MH approach, namely a 
test of the null hypothesis, Hq: am = 1/ 

MH-x^ = 1 1 SmRrm - 2:mE(Rrm) 1 - .5]2/i:niVar(Rrm), 
where, 

E(Rrm) = E(Rrm 1 a = 1) = NrmRtm/Ntm, 

Var(Rrm) = Var(Rrm 1 a = D = INrmRtmNfmWtm]/INtm2(Ntm - D]/ 

and where the -.5 in the expression for MH-x^ serves as a continuity 
correction to improve the accuracy of the chi-square percentage points as 

approximations to the observed significance levels. The quantity MH-x^ is 
approximately distributed as a chi-square with one degree of freedom. 

Holland and Thayer (1988) report: 

...that a test based on MH-X'^ the uniformly most powerful 
unbiased test of Ho versus Ha- Hence no other test can have 

higher power somewhere in Ha than the one based on MH-X'^ 
unless the other test violates the size constraint on the null 
hypothesis or has lower power than the test's size somewhere 
else on Ha^ (p. 134) 
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In other words, the MH approach is the statistical test possessing the most 
statistical power for detecting departures from the null DIF hypothesis that 
are consistent with the constant odds ratio hypothesis. 

2A Estimate of Constant Odds Ratio 

Mantel and Haenszel also provided an estimate of the constant odds-ratio, 

cxMH = [2:mRrmWfm/Ntml/ l^mRfrnWrm/Ntml • 

This estimate is an estimate of DIF effect size in a metric that ranges from 0 to 
oo with a value of 1 indicating null DIF. This odds-ratio metric is not 
particularly meaningful to test developers who are used to working with 
numbers in an item difficulty scale. In general, odds are converted to log odds 
because the latter are symmetric around zero and easier to interpret. 

2.5 MH DIF in Item Difficulty Metrics 

At ETS, test developers are used to working with item difficulty 
estimates in the "delta metric", which has a mean of 13 and a standard 
deviation of 4. To obtain a delta, the proportion correct (p) is converted to a z- 
score via a p-to-z transformation using the inverse of the normal cumulative 
function, followed by a linear transformation to a metric with a mean of 13 
and a standard deviation of 4 via: 

A = 13-4{<D-l(p)}, 

such that large values of A correspond to difficult items, while easy items 
have small values of delta. Holland and Thayer (1985) converted aMH i^^to a 
difference in deltas via: 

MH D-DIF= -2.35 InlaMHl- 

Note that positive values of MH D-DIF favor the focal group, while negative 
values favor the reference group. 

Another metric that is used more universally to describe item difficulty 
is the p-metric, percent correct or proportion correct metric. The aMH can 
also be expressed in this metric, 

MHP-DIF = Pf-Prt, 

where. 



Ptt = [aMHPfl / I(l-Pf ) + aMHPfl / 



which can be thought of as a predicted proportions correct in the reference 
group based on the MH odds-ratio, and Pf is the proportion correct observed 
in the focal group. 

2.6 Standard Error of the Mantel-Haenszel DIF Indices 

A useful, approximate standard error for the log of the Mantel- 
Haenszel odds-ratio estimator was developed by Robins, Breslow and 
Greenland (1986) and, in the equivalent form used here, by Phillips and 
Holland (1987). This expression may be multiplied by 2,35 to yield an 
estimated standard error for MH D-DIF, 

SE (MH D-DIF) = {2.35/Q*{Zm[(RrmW£m + aMHWrmRfm) 

* [Rrm + Wfm + aMH(Wrm + Rfm)l/(2Ntm2)]}-5 , 

where, 

C - ImRrmWfm/Ntm • 

The standard error for MH P-DIF, derived in HoUand(1989), is 
SE (MH P-DIF) := {(l-K)2Pf(l-P£)/Nf + 2K(l-K)P£(l-Pf)/N£ 
+ K2[Pf(l-Pf)}]2 [SE(MH D-DIF)/(2.35)12}.5 , 

where, 

K = aMH/(l - Pf+ aMHPf)2 , 
and Nf is the total number of examinees in the focal group. 

2.7 ETS DIF Qassification Rules 

To use the MH D-DIF measure to identify test items that exhibit 
varying degrees of DIF, a classification scheme was developed at ETS for use 
in test development that puts items into one of three categories — negligible 
DIF (A), intermediate DIF (B), and large DIF (C). Items are classified as A for a 
particular combination of reference and focal groups if either MH D-DIF is not 
statistically different from zero or if the magnitude of the MH D-DIF values is 
less than one delta unit in absolute value. Items are classified as C if MH D- 
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DIF both exceeds 1.5 in absolute value and is statistically significantly larger 
than 1.0 in absolute value. All other items are classified as category B. In both 
categories A and C statistical significance is at the 5% level for a single item. 
Presently an item can have up to five different MH D-DIF values associated 
with it, one for each of five possible combinations of focal and reference 
groups. An item is currently assigned the lowest letter grade from all the DIF 
analyses performed on it. 

2.8 The MH Procedure and The Rasch Model 

Holland and Thayer (1988) point out a dose connection between "chi- 
square" types of DIF procedures, such as the MH procedure and "theoretically 
preferred" methods based on item response theory models, such as those 
desaibed by Thissen, Steinberg and Wainer (in press). They draw this close 
connection in fairly abstract terms using a very general class of item response 
theory models. The interested reader should consult the original source for 
the mathematical details. To make matters concrete Holland and Thayer 
show how the Rasch model and the MH procedure are related when the 
assumptions underlying the Rasch model fit the data. In particular, they 
demonstrate that under the Rasch model the constant odds ratio hypothesis 
holds exactly in the population if: (1) all items in the matching aiterion, with 
the possible exception of the studied item, are free of DIF; (2) the criterion for 
matching is a number-right score that includes the studied item; and (3) the 
data are random samples from the reference and focal populations. It is only 
under these special conditions, some of which are strong, particularly the 
assumption that the Rasch model fits the data, that MH and Rasch model 
have a special relationship. It is important to realize that the Holland and 
Thayer analysis does not imply that the Rasch model and MH procedure are 
always intimately related. Instead, Holland and Thayer (1988) used the MH 
procedure and the Rasch model to relate the chi-square procedures and the 
the item response theory procedures under special conditions. In the process, 
they determined the need to include the studied item in the matching 
criterion, which has implications for DIF applications and future research, 
both of which will be discussed later. 



3. Standardization: A Flexible Method for Describing DIF 

In the early eighties, Dorans (1982) reviewed a number of item bias 
studies that had been conducted on SAT data in the late seventies. These 
studies had used the Angoff and Ford (1973) delta-plot methodology and, in 
some cases, a log-linear method. The delta-plot method can be justified from 
a one-parameter normal ogive item response theory model, and as such, is of 
as limited applicability to multiple-choice item data as the Rasch model. DIF 
detection with either the Rasch model or the delta-plot model is confounded 
with lack of model fit, a confounding that occurs frequently because items do 
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not have a common discrimination parameter. The log-linear approach 
employed in those early SAT studies was flawed because the conditioning 
variable was too coarsely grouped, a practice we refer to as fat matching. 
Taken to its extreme, fat matching leads to a single level for the matching 
variable, which converts DIF studies into impact studies. Dorans (1982) 
concluded that a new method was needed. 

Large data sets are often associated with SAT test forms. Given large 
SAT data sets and a desire to avoid contamination caused by model misfit, 
Dorans and Kulick (1983) decided to not employ IRT models. Instead, they 
opted for an IRT-like approach that compared empirical item response curves 
in which a total score was used as an estimate of ability. Summarizing these 
numerous non-parametric item test regressions via some numerical index 
seemed to be essential if this procedure was to become practical. They were 
steered in the direction of standardization via the Alderman and Holland 
(1981) report on DIF assessment for the Test of English as a Foreign Language 
(TOEFL). 

According to the standardization method, an item is exhibiting DIF 
when the expected performance on an item differs for examinees of equal 
ability from different groups. Expected performance on an item can be 
operationalized by non-parametric item test regressions. Differences in 
empirical item test regressions are indicative of DIF. 

One of the main principles imderlying the standardization approach to 
DIF assessment is to use all available appropriate data to estimate the 
conditional item performance of each group at each level of the matching 
variable. The matching done by siundardization and Mantel-Haenszel does 
not require the use of stratified sampling procedures that yield equal numbers 
of examinees at a given score level across groups. In fact, throwing away data 
in this fashion just leads to poorer estimates of effect sizes that have larger 
standard errors associated with them than effect sizes based on all the data. 

The first step in the standardization analysis is to use all available data 
to estimate non-parametric item test regressions in the reference group and in 
the focal group. Let Ef (I I M) define the empirical item test regression for the 
focal group /, and let Er(I I M) define the empirical item test regression for the 
reference group r, where I is the item score variable and M is the matching 
variable. The definition of DIF employed by the standardization approach 
implies that Ef(I I M) = Er(I I M). 

The most detailed definition of DIF is at the individual score level, m. 
Dm = Efm - Erm 
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where, Efm and Erm are realizations of the item-test regressions at score level 
m. The Dm are the fundamental measures of DIF according to the 
standardization method because these quantities are differences in item 
performance between focal group and reference group members who are 
matched with respect to the attribute measured by the test. Any differences 
that exist after matching cannot be explained or accounted for by ability 
differences. These are unexpected differences as opposed to those expected 
given ability differences. Plots of these differences, as well as plots of Ef (I I M) 
and Er(I I M), provide visual descriptions of DEF in fine detail. Figures 1 and 2 
are sample plots of non-parametric item test regressions and differences for 
an actual SAT item, which exhibits considerable DIF. In contrast. Figures 3 
and 4 are item test regressions for an actual SAT item which exhibits minimal 
DIF. 

Visual analysis is an important component of the standardization 
approach. Figure 1 comes from the first study to use standardization to do 
DIF analyses on the SAT (Dorans & Kulick, 1983). In that study, there were 
21,209 female examinees in the focal group and 21,285 male examinees in the 
reference group. In Figure 1, Efm and Erm are presented in a percent correct 
metric, ranging from 0 to 100, while the matching variable is score on the 
familiar 200-to-800 College Board scale. Each point in the plot represents the 
conditional item mean score (under rights scoring) at each scaled score level. 
This plot and the corresponding difference plot in Figure 2 provide detailed 
visual descriptions of difference and similarities of focal and reference group 
performance on the item at each of the 61 scale score levels ranging from 200 
to 800 in 10 point increments. 

The content for this item, which appeared on the December 1977 form 
of the SAT, reveals why there is such large DIF on this item. It is a verbal 
analogy item, 

DECOY : DUCK :: (A) net : butterfly (B) web : spider 

(C) lure : fish (D) lasso : rope (E) detour : shortcut. 
This edition of the SAT was assembled prior to the institution of the ETS Test 
Sensitivity Guidelines (see Ramsey, in press), which screen items for content 
or language that is offensive or could be detrimental to the performance of 
ethnic or gender subgroups. Had such guidelines been in place, this item may 
never have appeared in a final edition of the SAT because a casual 
examination of the item reveals that knowledge of hunting and fishing 
jargon probably influence performance on this item. Sex differences with 
respect to familiarity with this jargon probably accounts for why males 
outperform matched females at difference of 15% to 20% at each SAT-Verbal 
scaled score level between 250 and 500. For example at a scaled score level of 
300, over 60% of the males answer the item correctly, while only 40% of the 
females choose the correct response option. Clearly, this is a very easy item 
for males that is somewhat harder for females, an item that exhibits 
substantial DIF, and a high DIF item that is biased against females. 
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In contrast, the plots in Figures 3 and 4 depict an item that exhibits 
negligible DIF. This item came from the second study that used 
standardization on the SAT (Kulick & Dorans, 1983), in which father's level 
of education was used to compare groups from different socioeducational 
groups. Examinees whose fathers had not completed high school (the focal 
group, N = 7, 053) performed on this SAT-Mathematics item in much the 
same way as students whose fathers has attained at least a bachelor's degree 
(the reference group, N = 24,910). Whereas, the decoy : duck item had atypical 
DIF for its test edition, the DIF for this SAT-Mathematics item was more 
typical of items on that March 1980 form of the SAT. 

3.1 Standardization's Item Discrepancy Indices 

The sheer volume of the SAT item pool precludes sole reliance on 
item-test regression plots and difference plots for DIF assessment. There is a 
clear need for a numerical index that targets items like that depicted in 
Figures 1 and 2 for closer scrutiny, while allowing items such as that depicted 
in Figures 3 and 4 to pass swiftly through the screening process. 
Standardization has two such flags : the standardized p-difference (STD P-DIF) 
and the root-mean-weighted squared difference (RMWSD). Both indices use 
a weighting function supplied by the standardization group to average 
differences (or squared differences) across levels of the matching variable. 
The function of the standardization group, which may be a real group or a 
hypothetical group, is to supply a set of weights, one for each score level, for 
use in weighting each individual Dm ( or Dm^) before accumulating these 
weighted differences (or squared differences) across score levels to arrive at a 
summary item-discrepancy index. 

3.1.1 The standardized p-difference. The standardized p-difference is defined 
as: 

STD P-DIF = ImWm(Efm-Enn)/ImWm = ImWmDm/ImWm , 

where (wm/Swm) is the weighting factor at score level m supplied by the 
standardization group to weight differences in item performance between the 
focal group (Efm) and the reference group (Erm)- The standardized p- 
difference is so-named because the original applications of the 
standardization methodology defined expected item score in terms of 
proportion correct at each score level, 

STD P-DIF = ImWm(Pfm-Prm)/ImWm = SmWmDm/ImWm , 

where Pfm and Prm are the proportions correct, number of examinees who 
answer correctly over total number of examinees, in the focal and reference 
groups at score level m, 
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Pfm = Rfm/Nfm ; Prm = Rrm/Nrm • 

In contrast to impact, in which each group has its relative frequency 
serve as a v^eight at each score level, 

IMPACT = Pf-Pr 

= ImNfmPfm/ImNfm - ImNrmPrm/XmNrm / 

STD P-DIF uses a standard or common weight on both Pfm and Prm/ namely, 
(wm/Xwin). The use of the same weight on both Pfm and Prm/ or more 
generally Ef m and Enn/ is the essence of the standardization approach. In the 
equation above Pr is proportion correct observed in the reference group, 
while Pf is the proportion correct observed in the focal group. 

The particular set of weights employed for standardization depends 
upon the purposes of the investigation. Some plausible options are the 
following: 

- wm = Ntm/ the number of examinees at m in the total group; 

• Wm = Nrm/ the number of examinees at m in the reference group; 

- Wm = Nfm/ the number of examinees at m in the focal group; 
or - Wm = the relative frequency at m in some reference group. 

In practice, wm= Nfm has been used because it gives the greatest 
weight to differences in Pfm and Prm at those score levels most frequently 
attained by the focal group under study. Use of Nfm means that STD P-DIF 
equals the difference between the observed performance of the focal group on 
the item and the predicted performance of selected reference group members 
who are matched in ability to the focal group members. This can be derived 
very simply, 

STD P-DIF = ImNfm(Pfm-Prm)/ImNfm 

= ImNfmPfm/ImNfm • ImNfmPrm)/ImNfm, 

STD P-DIF = Pf-P£*, 

group predicted from the reference-group item-test regression curve, Prm/ or 
as suggested above, the predicted performance of selected reference group 
members who are matched in ability to the focal group. 
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STD P-DIF is an index that can range from -1 to +1 (or -100% to 100%). 
Positive values of STD P-DIF indicate that the item favors the focal group, 
while negative STD P-DIF values indicate that the item disadvantages the 
focal group. STD P-DIF values between -.05 and +.05 are considered 
negligible. STD P-DIF values between -.10 and -.05 and between .05 and .10 are 
inspected to insure that no possible effect is overlooked. Items with STD P- 
DIF values outside the {-.10, +.10} range are more unusual and should be 
examined very carefully. 

A delta metric version of the STD P-DIF index is: 

STD D-DIF = -2.351nttPf*/(l - P£*)l/[Pf/(l - Pf)!) • 

STD D-DIF tends to have a smaller variance than MH D-DIF across items, 
and correlates higher with MH D-DIF than does STD P-DIF across items. 

3.2 Standard Errors for Standardization's DIF Indices 

The standard errors for the standardization method DIF indices were 
also developed by Holland. The standard error for the focal group weighting 
version of STD P-DIF is 

SE(STD P-DIF) = {Pf(l-Pf)/Nf + VAR(Pf*)}-5 , 
where,. 

VAR(Pf*) = [ImNfm2Prm(l-Prm)/(NrmNf2)l . 
The standard error for the focal group weighting version of STD D-DIF is 

SE(STD D-DIF) = (2.35)a(Pf(l-Pf)Nr)-l+ VAR(Pf*)/Pf*(l - Pf*)}-5 , 
where Nr is the number of examinees in the reference group. 

3.3 Differential Distractor Functioning, Speededness and Omission 

DIP assessment does not stop with the flagging of an item for statistical 
DIF. In fact, the flagging step can be viewed as just the beginning. The next 
step is to try to understand the reason or reasons for the DIF. Green, Crone, 
and Folk (1989) have developed a log-linear approach for assessing what they 
call differential distractor functioning (DDF). The standardization approach 
to distractor analysis can also be quite helpful. Some of the items identified by 
Green, Crone and Folk will be analyzed from the standardization framework 
below; some of these items are also analyzed in Thissen, Steinberg and 
Wainer (in press) for differential alternative functioning (daf). 
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3.3.1 Differential Distractor Functioning. The generalization of the 
standardization methodology to all response options including omission and 
not reached is straightforward and is known as standardized distractor 
analysis (Dorans, Schmitt, & Bleistein, 1988, 1989), It is as simple as replacing 
the keyed response with the option of interest in all calculations. For 
example, a standardized response rate analysis on option A would entail 
computing the proportions choosing A (as opposed to the proportions correct) 
in both the focal and reference groups, 

Pfm(A) = Afm/Nfm; Prm(A) = Arm/Nrm , 

where Afm and Arm are the number of people in the focal c^nd reference 
groups, respectively, at score level m who choose option A, The next step is 
to compute differences bev»yeen these proportions, 

Dm(A) = Pfm(A) - Prm(A), 

Then these individual score level differences are summarized across score 
levels by applying some standardized weighting function to these differences 
to obtain STD P-DIFF(A), 

STD F-DIF(A) = ImWmDm(A)/ImWm / 

the standardized difference in response rates to option A, In a similar fashion 
one can compute standardized differences in response rates for. options B, C, 
D, and E, and for non-responses as well 

The plots produced by the standardized distractor analyses can be quite 
helpful in trying to interpret DIF data. As an example, cons^'der the plots in 
Figure 5. Portrayed are selected empirical option response curves for an SAT 
antonym item from a disclosed 1984 test form for which the key, distractors 
and DIF information are provided below: 



STD P-DIF (Option ) 



MA 


PR 


BLK 


PRACTICAL : 


4 
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12 


(A) difficult to learn 


0 


0 


0 


(B) inferior in quality 


1 


1 


1 


(O providing great support 


-5 


-11 


-16 


(D) having little usefulness 


0 


0 


0 


(E) feeling great regret 
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As can be seen in the table, standardization identifies DIF on the key, the 
opposite of practical is (D) having little usefulness, for Blacks ( BLK STD P-DIF 
= -16%) and Puerto Ricans (PR STD P-DIF = -11%), but only marginally for 
Mexican Americans (MA STD P-DIF = -5%). In addition, the STD P- 
DIF(option) values indicate where the *'anti-DIF" may lie, and the plots for 
the Black group corroborate these indications. Clearly, the Black and Puerto 
Rican focal groups are drawn towards (A) difficult to learn, which suggests 
that they have confused the word practical with the word "practice". 

For additional examples, we will use the two SAT items reported by 
Green, Crone and Folk (1989) to exhibit relatively small differential dis tractor 
functioning, and substantial differential distractor functioning. The 
standardized distractor information for the item with relatively small 
differential distractor functioning is shown below: 

STD P-DIF (Option ) 





HISP 


BLK 


DECADENT: 


-1 


0 


1 


(A) 


enormously wealthy 


-4 


-1 


-1 


(B) 


remarkably charming 


-3 


-2 


-2 


(O 


ruthless 


-1 


0 


0 


(D) 


distinctive 


5 


5 


3 


(E) 


flourishing 


63 


.76 


.44 


MH D-DIF 



This item exhibits marginal positive DIF for all three focal groups, Asian 
Americans (AA), Hispanics (HISP), and Blacks (BLK)., Likewise, there is very 
little differential distractor functioning, as measured by the standardization 
method. 

The data for the second item identified by Green, Crone and Folk (1989) 
is more interesting: 

In some animal species, differences between 

opposite sexes are so that it is difficult 

STD P-DIF (Option ) to tell that the male and female are . 

AA HISP BLK 



2 


10 


3 


(A) measurable .... distinct 


1 


-2 


-3 


(B) minute. . . . similar 


2 


2 


4 


(C) obvious . . . .indistinguishable 


-8 


-12 


-9 


(D) extreme related 


2 


1 


3 


(E) trivial .... identical 


1.09 


-1.40 


-1.05 


MH D-DIF 
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There is a moderate level of DIF for all three focal groups on this item; for 
Hispanics, the DIF level is particularly noticeable. The standardized distractor 
information is particularly informative for this focal group, who are drawn 
towards option (A) measuredble .... distinct in much greater proportions 
than the matched group of whites, who are drawn to option (D) extreme .... 
related. It is not clear to us why Hispanics are drawn towards (A), nor why all 
three focal groups exhibit negative DIF on this item. While the distractor 
analysis tells where the "anti-DIF" is, it doesn't tell us why it's there. See 
Thissen, Steinberg, and Wainer (in press) for a daf analysis of this item. See 
Schmitt, Holland and Dorans (in press) for examples in which the 
standardized distractor analysis corroborates DIF hypothesis for Hispanics. 

33*2 Differential Speededness. Application of the standardization 
methodology to counts of examinees at each level of the matching variable 
who did not reach the item results in a standardized not-reached difference, 

STD F-DIF(NR) = ImWm(Pfm(NR)-Prm(NR))/ImWm • 

For items at the end of a separately-timed section of a test, these standardized 
differences provide measurement of the differential speededness of a test. 
Differential speededness refers to the existence of differential response rates 
between focal group members and matched reference group members to 
items appearing at the end of a section. Schmitt and Bleistein (1987) found 
evidence of this phenomenon for Blacks, as compared to a matched group of 
Whites, on analogy items. Schmitt and Dorans (1990) reported that this effect 
was also found for Hispanics. In Dorans, Schmitt and Bleistein (1988), 
differential speededness results for Black, Hispanic and Asian-American focal 
groups, compared to a White reference group, are presented and their 
implications are discussed. In Dorans, Schmitt and Curley (1988), the effects 
of item position on differential speededness and on DIF assessment were 
investigated. This study, which is described in more detail in Schmitt, 
Holland and Dorans (in press), found that excluding examinees who do not 
reach an item from the calculation of the DIF statistic for that item partially 
compensates for the effects of item location on the DEF estimate. 

One implication that the existence of differential speededness has for 
analyzing DIF or DDF is that the matching variable, total score, may be 
contaminated due to differential speededness. Research presently being 
conducted by A. Schmitt and her colleagues may shed light on the seriousness 
of this potential contamination and the efficacy of potential solutions to the 
problem, such as matching on a shortened unspeeded portion of the total test. 
Simulation studies should prove useful here. 

333 Differential Omission. It should be obvious that standardization can 
also be applied to the study of differential omission. In fact, Schmitt and 
Dorans (1990) report on some of these studies including one by Rivera and 
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Schmitt (1988) who found that while Hispanics as a group omit more than 
Whites on the SAT, Hispanics tend to omit less than* whites of comparable 
ability- This is a dear example of Simpson*s paradox in terms of omitting 
behavior, an example which had immediate implications for the type of 
advice that was being offered to Hispanic test-takers. On the basis of the 
marginal distributions, it appeared that Hispanics were omitting less than 
Whites. After conditioning on total test score, it became clear that the 
opposite was true. So we close our discussion of the Mantel Haenszel and 
standardization methods with another illustration of the need to compare the 
comparable, 

4. Mantel-Haenszel and Standardization From a Common Framework 

Up to now, the Mantel-Haenszel method and the standardization 
method have been described from the the frameworks from which they 
evolved: Mantel-Haenszel as a powerful statistical test of the constant odds 
ratio model, and standardization as a non-parametric alternative to item 
response theory for describing item-ability regressions. The two procedures, 
however, share a common framework spelled out in Dorans (1989). 

For rights-scored tests, the standardization definition of null DDF is in 
terms of zero p-differences at all levels of the matching variable, 

Rfm/Nf m - Rrm/Nrm = 0 m = 1, ...,M . 

The definition of null DIF for Mantel-Haenszel is 

[Rrm/Wml / [Rf m/Wfml = 1 m = 1, ... ,M . 

When null DIF holds, the standardization definition can be rearranged as: 

Rfm/Nfm = Rrm/Nrm / 

RfmNrm = RrmNfm / 

Rfm(Wrm + Rrm) = Rrm(Wf m + Rfm) / 

RfmWrm + BfmRrn. = RrmWfm + RfmRrm , 

RfmWrm = RrmWfm , 

Rrm/Wrm = Rfm/Wfm , 
which becomes the Mantel-Haenszel definition of null DIF, 
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[Rrm/Wrml / [Rf m/Wf ml ==1 m = 1^.^ M . 

Mantel-Haenszel and standardization share a common definition of null DIF 
that is stated in different metrics. The two procedures differ with respect to 
how they measure departures from null DIF. 

Under rights scoring for the items in which responses are coded as 
either correct or incorrect (including omissions), both the standardization 
procedure and the Mantel-Haenszel procedure use the same basic data to 
focus on differences in conditional item performance, which can be 
operationalized as differences in non-parametric item test regressions 
(standardization) or in terms of a constant odds ratio model (Mantel- 
Haenszel). As we have seen earlier, counts of rights and wrongs on each item 
can be arranged into a 2(groups)-by-2(item scores)-by-M(score levels) 
contingency table for each item being studied. 

The Mantel-Haenszel and standardization procedures operate on the 
basic data of the 2(groups)-by-2(item scores)-by-M(score levels) contingency 
table in different ways. As a consequence, they measure departures from the 
null DIF condition in slightly different ways. 

The first difference in how the two procedures measure departures 
from null DIF is in the metric for defining DIF. Standardization uses 
differences in conditional proportions correct. 

Dm = Pfm • Prm^ 
while Mantel-Haenszel uses conditional odds ratios, 

am = [Rrm/Wrm]/[Rfm/Wfm] = IRrmWfml/lRfmWrml • 

The second difference in DIF measurement is in the choice of weights 
used to average the Dm or the am across levels of the matching variable. The 
Mantel-Haenszel approach uses weights that are nearly optimal statistically 
for testing a constant odds-ratio model. These weights are: 

MHm=WrmRfm/Ntm/ 

such that 

aMH = ImMHm am/ImMHm • 

In contrast, the weights employed in the standardization approach are not 
defined statistically. Instead they may be chosen to suit the needs of a 
particular investigator. This flexibility has not be utilized often. Instead the 
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intuitively appealing focal group frequency distribution, which was employed 
by Dorans and Kulick (1983) in their original work on the SAT, is typically 
used to describe departures from null DIF, 

STD P-DIF = ImNfm(Pfm-Prm)/ImNfm • 

Holland and Thayer (1988) pointed out that Cochran (1954) developed a set of 
weights for the p-difference metric that are statistically motivated, that is they 
are appropriate for testing a constant difference model across score levels.. 
These weights are: 

Cm = NrmNfm/Ntm • 

The third difference in DDF measurement between the two methods is 
the metric in which the final statistic is portrayed. Although, a delta metric 
version of the standardization DIP statistic has been developed, the primary, 
almost exclusive, metric used by standardization has been the p-metric, even 
for formula-scored tests where an item formula-scored metric would seem 
superior on logical grounds. In contrast, the delta metric has been the metric 
of choice for the Mantel-Haenszel method. One consequence of this 
difference in choice of metrics is that standardization tends to down play DIP 
in easy and hard items because the p-metric is bounded at both the top and 
bottom. In contrast, the delta, metric is unbounded at the extremes, and 
consequently differences for easy and hard items are played up 

Despite these differences in choice of metric and weighting, 
standardization and Mantel-Haenszel agree very closely with respect to 
measurement of departures from null DIP for the vast majority of items. In 
fact, correlations across items between the two DIP methods in the same 
metric, e.g. delta, are typically clo^e to uiuty, and slightly higher than within- 
method correlations between metrics, which are in the high nineties. Cross- 
metric cross-method correlations across items are usually in the mid- 
nineties. These correlations indicate that the two methods are measuring 
essentially the same thing, DIP, in slightly different ways; intuitively 
appealing weighting of conditional differences in proportions correct vs. 
statistically-driven weighting of conditional odds ratios. The correlations also 
indicate that the choice of metric for describing the DIP effect may be more 
critical from a practical point of view than the choice of method. 
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5. Implementation Issues 



DIF implementation at ETS occurred quickly once Mantel-Haenszel 
was selected as the method for DIF detection and standardization was selected 
as the method for DIF description. With implementation came an 
assortment of issues that required temporary if not permanent resolution. In 
this section of the report, several of these issues are discussed. In the next 
section, future research associated with issues that remain either unsolved or 
only partially solved are discussed. 

5.1 Induision of Studied Item 

Holland and Thayer's (1988) analysis of the interrelationship between 
Mantel-Haenszel and the Rasch model led to some counterintuitive 
conclusions about whether an item should be included as part of the criterion 
when DIF analysis is performed on the item. Holland and Thayer (1988) 
concluded on theoretical grounds that an item should be included as part of 
the matching variable: 

If it is not included, then the MH procedure will not behave 
correctly when there is no dif according to an IRT model 
However the Rasch analysis suggests that the inclusion of the 
studied item in the matching criterion does not mask the 
existence of dif, rather it is the inclusion of other items 
exhibiting dif in the criterion that could lead to the finding that 
no dif exists for the studied item when in fact it does, (p. 141) 

The need to ensure that other items in the matching criterion are free of DIF 
is one argument for criterion refinement, a procedure described below. 

The mathematical argument for inclusion of the item in the matching 
variable is presented in Holland and Thayer (1988), who also show how 
trivial it is to correct the M 2-by-2 tables for rights-scored tests in which 
number right score is the matching criterion. The correction for formula- 
scored tests, however, is not so trivial. 

5.2 Criterion Refinement or Purification 

An argument that is often voiced against the Mantel-Haenszel 
procedure, the standardization procedure, and other DIF assessment 
techniques that use an internal criterion is the circularity involved in using 
total test score as a criterion for matching. Although not a perfect matching 
criterion because all tests contain a certain amount of statistical noise, scores 
on a test are often the best available matching criterion for several reasons. 
First, the total test score is often a much more reliable measure of what any 
individual item purports to measure. Second, many test scores have 
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demonstrated validity for their intended purposes. Third, test scores are 
typically obtained under the same conditions for all examinees. 

Despite these advantages of reliability, validity, and standardized 
administration, tests scores are criticized because items are part of the score, 
and there is a concern about the circularity of using potentially biased test 
scores as a criterion for DIF analyses. The most direct way of demonstrating 
that the total test score is acceptable as a matching variable is to demonstrate 
that it is valid for its intended purposes, and that it is equally valid for all 
focal and reference groups. DIF analysis is not a substitute for validity studies. 
In fact, the DIF analysis assumes that the criterion is valid and fair. 

Since all tests are imperfect, they may in fact contain some itenns which 
do have DIF. Otherwise, the DIF analysis would be a meaningless exerdse. In 
an attempt to ensure that the matching criterion is in fact DIF-free, DIF 
analyses at ETS occur in two steps. The first step is called the criterion 
refinement or purification step. Here, items on the matching variable are 
analyzed for DIF, and any items that exhibit sizeable DIF are removed 
regardless of the sign of the DIF. Then this refined criterion is used for 
another DIF analysis of the same items and any other items that were not 
included in the criterion refinement step. 



6. Future Directions 

DIF implementation is in a nascent stage. Much basic research has 
been done, but much more needs to be done. Our methodologies for DIF 
assessment are good, but could be better. In this section, areas for further 
methodological research are identified. These areas fall into three major 
classes: the matching variable; the studied variable; and the group variable. 

6.1 The Matching Variable 

6.1.1 Dimensionality and DIF: The need for multivariate matching. Items 
with sizable DIF are items that behave differently for one group. Thiis 
difference indicates that the identified item does not appear to measure the 
same construct as the total test. Thus DIF measures violations from 
unidimensionality. The unidimensionality of the matching variable is 
central to the DIF assessment process. Shepard (1982) stresses this by saying: 

...It should be clear that the assumption of unidimensionality 
underlying all of the the (DIF) methods is not merely a 
statistical prerequisite but it is central to the way in which item 
bias is defined, (p. 25) 

Later, Shepard (1987) discusses how multidimensionality and DIF interact: 
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It is also generally understood that the various (DIF) 
procedures function by signalling multidimensionality. 
Therefore, the statistical indices can detect when subparts of the 
test are measuring differently for different groups, but are not 
automatically evidence of bias. To address the issue of bias 
requires re-examination of the original construct; is the source 
of multidimensionality some irrelevant difficulty (hence bias) 
or a valid subdimension of the intended construct, (p. 1) 

From a factor analytic point of view, multidimensionality abounds in 
item data. Each item is a measure of what the total test measures, i.e. what it 
has in common with other items, and what it alone measures, its unique 
item factor. When a test is composed of unidimensional items, as is the case 
for the mathematical portion of the Scholastic Aptitude Test, DIF occurs 
when subgroup differences along the unique item dimension do not reflect 
subgroup differences in developed mathematical ability. When a test is 
measuring multiple dimensions, as is likely to be the case with a science 
achievement test, DIF may reflect unique item factor differences between 
subgroups or the fact that subgroups vary in different ways on the different 
dimensions measured by the test. DIF is a violation of unidimensionality, 
but simple interpretation of DIF requires a unidimensional matching 
variable. See Bleistein and Schmitt (1989), Dorans and Schmitt (1989), Hu and 
Schmitt (1989), Mazzeo (1989), and Morgan (1989) for a series of papers on the 
interplay between DIF assessment and dimensionality. 

A multidimensional matching variable complicates DIF assessment. 
Multivariate matching, however, may provide a solution to the problem of 
multidimensionality. In multivariate matching, examinees are matched on 
more than one variable. For example, a general developed ability test might 
be composed of verbal reasoning and mathematics items. Matching on a total 
score might reveal that the verbal items exhibit positive DIF for females, 
while the mathematics items exhibit negative DIF. One option is to perform 
separate DIF analyses for the verbal items and for the mathematics items, as is 
now done with the SAT. Another option is to match on both the verbal score 
and the mathematics score prior to comparing how the items function in 
both groups. 

Multivariate matching can have heavy data requirements because of 
need to cross the levels of all the variables that go into the match. In 
addition, data may be sparse for many combinations of the two or more 
variables especially if they are highly correlated. Where data are sparse, 
separate analyses against more unidimensional criteria, e.g. math items 
against a math score, and verbal items against a verbal score, may be the only 
practical option. Methods such as propensity score matching (Rosenbaum & 
Rubin, 1985) may be a useful solution when data are sparse. 
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6.12 Inclusion of studied item for formula-scored tests. As mentioned 
earlier, Holland and Thayer (1988) demonstrated how easy it is to adjust tlie 
MH calculations for inclusion of the studied item in the matching criterion 
when the matching aiterion is a number-right score- Inclusion of the studied 
item with a formula scored criterion is not at all straightforward because it is 
not a simple matter to adjust the matching variable after the formula score 
has been rounded to integer format. As a consequence, some peculiar 
practices have evolved with DIF analyses for formula-scored tests. . For 
example, analyses of studied items that are external to the matching variable, 
e.g. pretest items collected in the non-operational section of the SAT, are done 
against a rights-scored criterion despite the fact that the test was administered 
under formula-scored conditions. Under formula scored conditions, 
omitting an item is different than getting it incorrect. Under rights scored 
conditions, omitting an item is treated the same as getting the item incorrect. 
So in order to include the pretest item in the criterion, the matching variable 
is scored in a manner that is inconsistent with test administration conditions. 

One potential solution to this problem is to employ multivariate 
matching on rights, wrongs, omits and not reached (presently examinees who 
do not reach the item are excluded from the calculation of the DIF statistic). 
Another option is to use a version of formula scoring in which a correct 
response is assigned a score equal to the number of response options, an omit 
or not reached is assigned a one, and a wrong is assigned a zero. Under this 
type of formula scoring, there are no fractions, and hence no need to roimd to 
integer format. Hence the adjustment for inclusion may be as simple as it is 
for rights scored tests. 

62 Studied Variable 

6.2.1 Formula score DIF. It has been a common practice to rights score items 
for the purpose of item analysis regardless of the conditions under which the 
item was administered. For rights scored tests, this is a perfectly reasonable 
practice. For formula-scored tests, however rights scoring of the item is not 
consistent with the conditions under which the item was administered. Had 
the examinee known an item was to be rights scored, it is unlikely he would 
have omitted that item since omitting is tantamount to getting the item 
wrong on a rights scored test. 

The DIF computer programs used at ETS employ rights scoring of items 
for both Mantel-Haenszel and standardization to obtain MH D-DIF, MH P- 
DIF, STD P-DIF, and STD D-DIF. In addition, the program can be asked to 
compute a little-used standardization statistic, which may in fact be the best 
standardization statistic to use for formula-scored tests, such as the SAT, 

STD FS-DIF = ImWm(Efm-Enn)/ImWm = ImWmDm/SmWm , 
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where instead of scoring the item 1 if correct and 0 if incorrect or omit, which 
yields STD P-DIF, the item is scored 1 if correct, 0 if omit, and -l/(k-l) if 
incorrect, where k is the number of response options. Under this type of 
scoring, the expected item performance in the focal group at score level m is 

Efm = {Rfm*(l) + Ofm*(0) + Wfm»(-l/(k-l)))/N£jR, 

where Rfm/ Wfm/ and Ofm are counts of the number right, the number 
wrong, and the number of omits, respectively at score level m in the focal 
group and Nfm is the sum of Rfm/ Wfm/ and Ofm- Likewise, for the 
reference group, we have 

Erm = {Rrm*(l) + Orni'KO) + Wrm*(-l/(k-l))}/Nrm, 

Unlike STD P-DIF, STD FS DIF does not range from -1 to +1. Instead, its 
theoretical range is -k/(k-l) ;o +k/(k-l). Under no omitting, which is likely 
for easy items, STD FS-DIF = k/(k-l) STD P-DIF. These two standardization 
indices are more likely to diverge when items are difficult and omitting 
becomes a dominant behavior. 

6.22 Testlet DIF. Most DIP assessment procedures are just that differential 
item functioning procedures; the item is the unit of analyses. Some 
differential functioning issues are better answered at a larger level of analysis, 
such as performance on a set of reading passage items, or performance on a 
set of items of comparable content. Here, the unit of analyses shifts to the 
testlet (Wainer & Kiely, 1987). Special types of testlets called item parcels have 
been useful in dimensionality assessment (Dorans & Lawrence, 1987). 
Wainer and Lewis (1990) have shown other areas where testlet-level analysis 
has also proved superior to item-level analysis. Dorans and Lawrence (1987) 
argue that parcel (testlet) analysis may be preferable to item analysis because 
testlets are more reliable indicators than single items. There exists a need to 
develop and try out procedures for Testlet DIF, or DTF to be exact. Some 
promising possibilities are the flexible standardization method (Dorans & 
Kulick, 1986), IRT-based models developed by Thissen and his colleagues and 
linear regression procedures. 

The standardization approach could be readily adapted to testlet DIF by 
replacing expected item performance with expected testlet performance in the 
basic standardization equation. This would result in a comparison of 
empirical testlet-test regressions, using a standard weighting function to 
produce numerical indices that describe how far apart these regressions are 
for some standardization population. 
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As the number of items in a testlet increases, the more likely it is that 
the testlet-test regression will be linear provided item difficulties are 
somewhat spread out among items defining the testlet. In that case, a 
comparison of linear regressions would be possible. 

6.3 Group Variable 

6.3.1 Melting-pot DIF. Hu and Dorans (1989) recently found that removal of 
an item that was flagged for positive gender DIF lowered females scores 
slightly and raised males scores slightly, as expected. It also had some 
unintended consequences. It raised the scores of Hispanics and Asian- 
Americans more than it raised male scores. In addition, it raised the scores of 
Hispanic and Asian-American females despite the fact that deletion of this 
item with positive gender DIF reduced the overall female mean score. In 
addition to pointing out that deleting items for DIF can have unintended 
consequences for the groups that were not the focus of analysis, this finding 
demonstrates a flaw with the "marginal DIF analysis" that we do now. 
Instead of crossing gender with ethnicity/race to study DIF, we look at the 
margins, i.e., we do DIF analyses on gender and we do DIF analyses on 
ethnicity/race. This "marginal DIF analysis" ignores potential interactions 
between gender and ethnicity/ race, interactions that may be important. One 
possible solution to this problem is to do Melting-Pot DIF analyses in which 
the reference group is the population of all test-takers who meet the 
appropriate grade level and language proficiency criteria, the melting pot 
group. Each gender/ethnic group is a focal group. Melting-Pot DIF would 
permit one to do gender comparisons within ethnic group, as well as ethnic 
group comparisons with gender group. Marginal DIF analyses could be 
obtained, of course, by collapsing across the other margin. One advantage of 
Melting-Pot analysis is that everybody is a focal group member once and a 
reference group member once. Ahother advantage is that more DIF is more 
likely to be found in the smaller subpopulations because they are a smaller 
part of the melting pot. On the negative side, DEF will be harder to find in the 
larger groups, such as White females and White males. One possible solution 
to the problem would be to borrow Wainer's (1989) notions of standardized 
impact, similar to the standardization index, and total standardized impact, 
which can be obtained by weighting the standardized impact by the number of 
individuals in the focal group. This practice, however, might introduce the 
opposite problem: small groups would be ignored. 

6.3.2 Educational Advantage Construct, As DIF implementation moves 
swiftly along at ETS and elsewhere, it is clear that several fundamental issues 
require more attention. Several of these issues have been discussed in this 
section of the report. One very important issue that remains to be discussed is 
that of focal group definition. To date, focal groups have been intact easily- 
defined groups such as Asians, Blacks, Hispanics and females. References 
groups have been Whites or males. It could be argued, however, that these 
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intact ethnic groups are merely surrogates for an educational disadvantage 
attribute that should be used in focal group definition. In fact, within any of 
these groups, there is probably a considerable degree of variability with respect 
to educational advantage or disadvantage. Perhaps, we should be focusing 
our group definition efforts towards defining and measuring educational 
advantage or disadvantage directly. This argument echoes that made more 
than a decade ago in the American Psychologist by Novick and Ellis (1977), 
where a strong case was made for "the explicit identification of those 
attributes that constitute disadvantage, rather than accepting group 
membership as a surrogate for disadvantage" (p. 318), and more recently by 
Schmitt and Dorans (1990). Novick and Ellis acknowledged that the problems 
of understanding what constitutes disadvantage and being able to measure it 
adequately were formidable. They still are. Significant advances in DIP 
implementation, however, may depend on serious efforts that address this 
issue. 



7. Closing Comments 

The major purpose of this report was to present the Mantel-Haenszel 
technique for DIP detection and the standardization technique for DIP 
description. We began by making the important distinction between DIP and 
Impact, pointing the need to compare the comparable. Then the Mantel- 
Haenszel procedure and the standardization procedure were described in 
some detail in that order. A common framework was used to present 
similarities and dissimilarities between the two methods. Then we discussed 
relationship of the MH procedure to IRT methods for DIP detection in 
general, and the Rasch model, in particular. Then the use of standardization 
for assessing differential distractor functioning, differential speededness and 
differential omission was presented. 

Several issues in applied DIP analyses were discussed includinj^, 
inclusion of the studied item in the matching variable, and the refinement of 
the matching variable. Future research topics dealing with the matching 
variable, the studied variable and the group variable were discussed. 

Large scale DIP implementation is a relatively new phenomenon i \ 
the field of measurement. Low-cost, practical, statistically-sound techniques, 
like the Mantel-Haenszel and standardization approaches, have made large 
scale implementation a reality. These are powerful techniques for DIP 
detection and description. As the implementation issues and future direction 
sections of this report indicate, these procedures could be improved, made 
more applicable to the actual testing situation. Although they are sound 
methods for DIP assessment, enhancements can and should be made. The 
major focus of future DIP research efforts, however, should not be on 
methodological enhancements. Although it could be improved, the 
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methodology is quite sound. Future research should focus on trying to 
uncover testable, verifiable, robust explanations for why DIF occurs when it 
does. As Schmitt, Holland and Dorans (in press) reveal, this will not be an 
easy task, partly because DIF is usually small relative to other item properties 
such as difficulty, partly because DIF research is constrained by many practical 
and ethical constraints, and partly because DIF, like bias, is a political issue, as 
well as an issue that is laden with emotional overtones. The major challenge 
facing the DIF field is to take the methods described in this report or the 
methods described in Thissen, Steinberg and Wainer (in press) and use them 
to identify replicable DIF, generate sound hypotheses about this replicable DIF, 
test these hypotheses under controlled conditions, and develop guidelines for 
producing future tests that are free from these irrelevant sources of group 
differences. 
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